Published 05/21/2004
Reagan Moore, Distinguished Scientist and co-director of the Data and Knowledge Systems program at the San Diego Supercomputer Center (SDSC) at UC San Diego, has been invited to deliver a keynote address at the 19th International Supercomputing Conference 2004 (ISC2004) to be held in Heidelberg, Germany, June 22-25.
In his talk on "Integrating Data and Information Management," Moore will look at the emerging approach of using "virtualization mechanisms" for the management, analysis, and preservation of distributed digital data. "Virtualization mechanisms rely on the separation of content management from context management from knowledge management," explains Moore. "This approach is being used to implement a common architecture useful for data grids for the sharing of data, digital libraries for the publication of data, and persistent archives for the preservation of data and the management of technology evolution." The goal is to build a data management environment in which all aspects of data discovery, manipulation, and preservation can be automated, and driven from an application.
Under the three-part theme of "Applications, Architectures, Trends," the International Supercomputer Conference 2004 will offer three days of intensive technical sessions, with international experts discussing real-world issues in high performance computing (HPC). Each day of the conference program will include a keynote speaker on the theme for that day, along with other presentations on related topics.
Day one will feature "Applications," with a keynote address by Steve J. Wallach of Chiaro Networks discussing "The Search for the Softron - Will We Be able to Develop Software for Petaflop/s Computing?" Day two will focus on "Architectures," with Bill Camp of Sandia National Laboratories and Fred Weber of AMD giving a keynote presentation on "The Red Storm Project - History and Anatomy of a Supercomputer." SDSC's Moore will give the keynote talk on the third day, which will focus on "Trends." Complete information about the conference can be found at http://www.isc2004.org/. -Paul Tooby
In the following interview, Moore gives Supercomputing Online readers a look at some of the challenges of "Integrating Data and Information Management" as dispersed research institutions increasingly work toward making data easily accessible from multiple sites.
SC Online: As a prelude to your keynote presentation at the ISC2004 conference this June in Heidelberg, can you please give your perspective on where we are today in this field?
Moore: At SDSC, we have been quite successful in managing distributed data, and believe that most of the basic concepts are understood (virtualization methods for infrastructure independence, latency management, federation of data grids) and implemented in data grid technology. We note that in order to manage distributed data, we had to associate information with each digital entity. This meant that the administrative attributes had to be organized and managed in a database. In order to manage data, we had to build catalogs containing administrative information.
Grid data management systems for accessing distributed data are taking a similar approach and have to choose how to organize and how to manage the data and the metadata.
Current information management technologies only address part of the requirements. Information is the assertion that a semantic label or name can be applied to a digital entity. Database technology manages semantic labels and the associated data, but does not provide a way to manage the assertion on which the semantic label was based. We encounter a similar problem when federating data grids. We need to be able to specify the access and update constraints that each data grid imposes on their name spaces. The management of information requires the ability to manage the relationships and rules on which the assertions are based.
We are just starting to understand the concept of information, and how information is differentiated from knowledge, for implementation in data management systems.
SC Online: Can you elaborate on the virtualization mechanisms you'll be discussing in your presentation? Examples?
Moore: Data virtualization provides the ability to access data without knowing its storage location. Data virtualization is based on the ability to create an infrastructure-independent naming convention. In data grids, this is called a logical file name. We can then map administrative attributes onto the logical file name such as the physical location of the file and the name of the file on that particular storage system. We can also associate the location of copies (replicas) with the logical name. If we map access controls onto the logical name, then when we move the file the access controls do not change. We can map descriptive attributes onto the logical name, and discover files without knowing their name or location.
Storage virtualization is the ability to access data through your preferred access method, instead of having to use the particular protocol implemented by the remote storage repository. Storage virtualization is based on the management of a standard set of operations for remote manipulation (UNIX file-based access, latency management, administrative metadata manipulation, etc.). Federated servers that are installed at each storage repository can then translate from the required local protocol to a standard access protocol. An example is implementing servers that allow read/write operations to work on files, blobs in databases, files in archives, objects in object ring buffers, etc.
Information repository virtualization is the ability to manipulate a catalog that is stored in a database. Again it is a standard set of operations that are used to support loading of XML files, export of XML files, bulk load, bulk unload, schema extension, automated SQl creation, etc. This makes it possible for a data grid to store administrative metadata in any database.
Access virtualization is the ability to translate from your favorite access mechanism (C library call, C++, Perl, Python, shell command, Java, OAI, Web browser, etc.) to the standard operations supported by storage and information repository virtualization. Web service environments are an example of an access mechanism.
SC Online: Where do you see things heading over the next year or two? What is your vision of data grids, say, five years out?
Moore: Data grids provide the infrastructure for assembling collections of data that are distributed across multiple sites. They separate the management of content (digital entities) from the management of context (metadata attributes). I expect there to be multiple approaches to solving this problem, with the number of approaches growing each year. In the next two years, we will see an integration of data grids with digital libraries and persistent archives, and the creation of federated environments for the sharing, publication, and preservation of data. In the next five years, we will see technologies for the management of relationships between semantic labels and the integration of knowledge management with data and information management. As these multiple approaches progress, they will begin to create a common infrastructure which others will then begin to use, much as is the case with XML.
For instance, we need to manage the consistency between content and context for data distributed across different storage architectures. Data grids implement logical name spaces for storage resources, users, and files. Data grids provide the required consistency by updating administrative metadata after every operation. However, when data grids are federated, we also need to specify the consistency constraints for sharing data and updating metadata between the grids. The grids must agree on who controls the changes to the administrative metadata, who controls replication of files, who controls access to the storage repositories. The consistency constraints are rules that are applied to the logical name spaces that describe how I manage my information, or the rules and relationships to describe its use. We see the need to describe a context not only for data, but also a context for information (the naming convention or semantic labels). The rules and relationships that specify federation constraints are one form of information context. The logical relationships used to explain the meaning of a term are another form of information context. Information context is a form of knowledge.
To build a data grid, you assign and manage information about each digital entity for when it was created, who owns it. You build metadata catalogs to manage the data context. To federate data grids, you need to coordinate metadata management across multiple repositories, by specifying rules for both access and data context update. It's a major challenge learning how to characterize and manage these relationships. We need to come up with a way to characterize and manage information context. The emerging ontology web language may provide a way to apply the relationships that define the information context.
We also expect to see massive collections, with the need to manipulate millions of files. Web service environments will be combined with dataflow systems in the future to support the manipulation of large numbers of digital entities or large numbers of result sets. Based on prior experience with data grids, the ability to manage bulk operations will be essential for viable web services.
SC Online: What are the obstacles to getting there? Are they technical, organizational or sociological?
Moore: The major obstacle is identifying an appropriate syntax for managing relationships (i.e. knowledge). In practice, we use different software infrastructure for each type of knowledge relationship (digital library cross-walks and OWL for logical relationships, GIS systems for spatial relationships, workflows for procedural relationships, data analysis systems for functional relationships). We make progress (create new knowledge) when we are able to apply multiple types of relationships to a collection and identify situations where multiple types of relationships simultaneously apply. We create a name (semantic label) to describe the detected simultaneous satisfaction of the set of relationships.
When we work with scientific communities, they assign meanings to their semantic labels that represent their interpretation of reality. Effectively, they create a set of relationships that define when a given semantic label can be applied. They augment the set of rules for applying a semantic label that are used within their discipline, by the set of rules they need to express their new research. They usually assert that their interpretation allows them to make inferences that other interpretations miss. The terms in use within the discipline to discuss the general case are frequently too restrictive to describe their new research results.
Here are a couple of examples. Two different communities involved in sharing information are the digital library community and the preservation community. As they meet in the digital world, they have to reconcile the terms they use for information. Although they are in similar fields, each community uses different terminology. After they reach agreement, they will then need to reconcile their terms with those used by computer scientists.
In the scientific arena, we use data grids to build data sharing environments for multiple scientific disciplines. But each discipline typically uses a different set of descriptive metadata that is not understood by others. This is true even for sub-disciplines within a scientific domain. Each group needs to be able to characterize when they associate a different meaning with a semantic label.
If we take a step back, the question is whether there is a definition for the management of data, information and knowledge that can be used across all of these communities that makes it simpler to share or publish data? We have been trying to decide on an appropriate way to simplify data management in the digital world. In doing that, we have created our own interpretation of the meaning of data, information, and knowledge.
To make progress, research groups need the freedom to add nuance of meaning, and the tools to describe the nuance of meaning that represents their research result. Every semantic label has a context. We will be able to make more rapid progress when we are able to express information context.
SC Online: Is there a chance to have standards for data semantics and metadata?
Moore: Semantic meaning depends upon the context in which the label was applied. The terms we used yesterday will have different meaning tomorrow, because we will associate the term with a different set of words. For example, in grammar school the word "sun" is associated with yellow, light, hot. In college, the word "sun" is associated with spectrum, photons, star. The same word evokes different explanatory terms.
My favorite example is the term "Persistent Archive." To an archivist, a persistent archive is the collection that is being preserved. To a supercomputer manager, a persistent archive is the infrastructure that is used to preserve a collection. The real meaning of the term is the combination of the interpretations by each community.
When we can define the relationships (semantic, temporal, spatial) that underlie the meaning we associate with a term, we can express a standard for metadata.
SC Online: Are data grids in competition with global filesystems or are they complementary?
Moore: Data grids provide a way to organize and describe data that may be stored in global filesystems. In this sense, data grids are a generalization of global filesystems. Data grids work across all types of storage systems, not just filesystems. Data grids manage a context that includes administrative, descriptive, and integrity attributes. Data grids manage access latency for distributed environments. Data grids are a mechanism for managing data stored in global filesystems, databases, and archives, while allowing each type of storage system to maintain control over the management of the local hardware resource.
SC Online: In closing, is there anything you'd like to cover that we haven't talked about yet?
Moore: We encounter multiple scientific disciplines that either have or are assembling petabyte-sized collections. They include Earth Observing satellite data (2 PB today), weather and climate simulation (1 PB today), high energy physics (2 PBs today), radio astronomy (PB today), synoptic surveys in astronomy (2 PBs per year), bio-medical imaging, etc. The need to manage data, be able to apply semantic labels to features in data, and be able to track the processing rules that govern the application of the labels will have to be automated. We need mechanisms to automate curation processes for digital libraries, archival processes for persistent archives, and analysis processes for scientific research.
SC Online: One final question - How will we get buy-in to this?
Moore: By having infrastructure that works.
International Supercomputer Conference 2004
ISC2004
http://www.isc2004.org/
Data and Knowledge Systems (DAKS)
http://daks.sdsc.edu/